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ABSTRACT 

Chromatin domain boundary elements prevent in- 
appropriate interaction between distant or closely 
spaced regulatory elements and restrict enhancers 
and silencers to correct target promoters. In spite of 
having such a general role and expected frequent 
occurrence genome wide, there is no DNA 
sequence analysis based tool to identify boundary 
elements. Here, we report chromatin domain 
Boundary Element Search Tool (cdBEST), to 
identify boundary elements. cdBEST uses known 
recognition sequences of boundary interacting 
proteins and looks for 'motif clusters'. Using 
cdBEST, we identified boundary sequences across 
12 Drosophila species. Of the 4576 boundary seque- 
nces identified in Drosophila melanogaster genome, 
>170 sequences are repetitive in nature and have 
sequence homology to transposable elements. 
Analysis of such sequences across 12 Drosophila 
genomes showed that the occurrence of repetitive 
sequences in the context of boundaries is a 
common feature of drosophilids. We use a variety of 
genome organization criteria and also experimental 
test on a subset of the cdBEST boundaries in an 
enhancer-blocking assay and show that 80% of 
them indeed function as boundaries in vivo. These 
observations highlight the role of cdBEST in better 
understanding of chromatin domain boundaries in 
Drosophila and setting the stage for comparative 
analysis of boundaries across closely related 
species. 



INTRODUCTION 

Eukaryotic genomes contain a large number and variety 
of regulatory elements that control the cell type and 



context dependent expression patterns of genes. Much of 
the genome that does not code for proteins, contains these 
regulatory elements often in the close proximity of the 
target gene but frequently also at far away locations. 
Specific and appropriate interaction of regulatory 
elements governs the complex and regulated expression 
of genes. However, much of the mechanism involved in 
this process remains to be understood and, in particular, it 
is far from clear how the specificity of interactions among 
regulatory elements is achieved. It is known that given 
access by bringing together in transgenic context or by 
chromosomal rearrangements, enhancers as well as silen- 
cers can act on almost any promoter. It is also known that 
expression of a transgene construct in different independ- 
ent transgenic lines often depends on the site of insertion 
in the genome, a phenomenon referred to as position 
effect, due to the influence of local regulatory elements 
on the transgene. These observations suggest that in 
order to restrict infidel and distantly located elements to 
appropriate target promoters the genome is subdivided 
and highly organized by means of functionally independ- 
ent 'chromatin domains' (1). 

Critical to this chromatin domain model are chromatin 
domain boundary elements, the DNA sequences that 
define the borders of chromatin domains and capable of 
blocking enhancer-promoter interactions. Chromatin 
domain boundary elements were first identified in 
Drosophila melanogaster, as specialized chromatin struc- 
tures that border the two heat shock genes in 87A7 heat 
shock locus. The DNA sequences that are responsible for 
the bordering effect were named as scs and scs' (specialized 
chromatin structures), and they became the first molecu- 
larly defined boundary elements (2). Thereafter, several of 
boundary elements have been identified in D. melanogaster 
at various genomic locations, the notable ones are gypsy, 
Mcp, Fah-7 and Fab-8 (3-8). The gypsy boundary is part 
of naturally occurring gypsy retrotransposon and the 
Mcp, Fah-7 and Fah-8 boundary elements are present 
within the bithorax complex [BX-C] of Drosophila and 
are required for domain specific expression of Abd-B 



*To whom correspondence should be addressed. Teh +91 40 27192658; Fax: +91 40 27160591; Email: mishra(a;ccmb.res.in 
© The Author(s) 2012. Pubhshed by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecomnions.org/licenses/ 
by-nc/3.0). which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original worlt is properly cited. 



4386 Nucleic Acids Research, 2012, Vol. 40, No. 10 



gene. Apart from Drosophila, the boundary elements were 
also identified in variety of organisms ranging from yeast 
to human (9-11). 

Experimental identification of boundary elements 
in Drosophila involves the two main functional assays. 
First, the enhancer blocker assay where the reporter 
gene is driven by a minimal promoter and a strong 
enhancer and when a boundary element is placed 
in-between the enhancer and promoter, the expression 
level of reporter gene is eliminated or reduced (12). The 
second assay is known as insulation from position effect 
where a reporter gene is flanked by boundary elements 
that to insulate from chromosomal position effects leading 
to uniform levels of expression in independent transgenic 
hues (13). Though, these assays have been successfully 
used for various boundary elements, they have an inherent 
disadvantage of having to produce transgenic animal hnes 
and as a result, these assays cannot be applied for 
genome-wide screening. As an alternative to testing in trans- 
genic flies, recently boundary elements have been assayed in 
Drosophila cells (14,15). More recently, a ceU culture based 
barrier assay has been used where ability of a test DNA to 
prevent spread of repressive chromatin has been assayed 
(16). These testing methods also involve making of 
constructs and establishing ceU lines, and, therefore, limit 
their applicability at genome level analysis. Based on the 
functional assays used, boundary elements are also 
referred as enhancer-blocking insulator or barrier element. 

Though it is accepted that boundary elements divide 
the genome into domains, the mechanism by which they 
establish independent functional domains remains unclear 
and various models that have been proposed so far, 
suggest that they may use more than one mechanism 
(10,17). All the models agree that boundary elements 
exert their function through interaction with various 
DNA-binding proteins and associated factors. To date, 
in Drosophila six DNA-binding proteins are shown to 
directly interact with boundary elements. These include 
BEAF, Zw5, GAGA Factor (GAF), Su(Hw), dCTCF 
and recently identified Elba factor (18-27). A boundary 
element may require one or more of these DNA-binding 
proteins to function as a boundary in vivo (21,26,27). 

Recent studies have reported the genome-wide binding 
profiles of various boundary binding proteins using 
ChlP-on-chip approach (28-32). Although these binding 
profiles can precisely map the in vivo binding sites of indi- 
vidual proteins, it is not clear how far these individual 
protein binding profiles can define functional boundaries 
in the genome since they define only the binding sites of 
proteins rather than defining complete boundary 
sequence. The experimentaUy identified boundary size 
varies from 431 bp to ~2.5kb (Supplementary Table SI). 
Moreover, these proteins are also involved in other 
nuclear functions such as transcriptional activation or 
silencing apart from their boundary function (28,33). 
In vivo binding site analysis of a boundary interacting 
factor does not necessarily indicate an associated 
boundary function. 

Here, we report a bioinformatics tool, chromatin 
domain Boundary Element Search Tool (cdBEST) for 
identification of potential boundary sequences in 



Drosophila. Using cdBEST, we identified 93109 bound- 
aries across 12 Drosophila species. Our approach identified 
4576 boundaries for D. melanogaster, including several 
repetitive boundaries. We also analysed these boundaries 
for their context in terms of flanking genes and experimen- 
tally tested 19 cdBEST boundaries in Drosophila S2 cells 
for enhancer-blocking activity and found that great 
majority of these cdBEST boundaries indeed function as 
boundaries in vivo. 



MATERIALS AND METHODS 

Sequence data 

The known boundary sequences were retrieved from 
National Center for Biotechnology Information (NCBI) 
(http://www.ncbi.nlm.nih.gov/) by foUowing the informa- 
tion provided in published hteratures (Supplementary 
Table SI). We used D. melanogaster genome release 
version 5.2 of NCBI for boundary prediction. The foUow- 
ing are the accession numbers for various chromosomal 
arms of D. melanogaster that were used in this study: 
NT_033777.2 (3R), NT_037436.3 (3L), NT_033778.3 
(2R), NT_033779.4 (2L), NC_004354.3 (X) and 
NC_004353.3 (4). 

CdBEST 

Calculation of motif frequencies and fold enrichment 
values. To calculate genome level motif frequencies (fg), 
we searched for the occurrences of individual boundary 
motifs {mi—m„) in the Drosophila genome and divided 
the total number of occurrences with genome size. 
Similarly a boundary level frequency (//,) was also 
calculated by considering boundary sequence alone for 
each boundary motif (see Supplementary Table S2 for 
details). The ratio between genome level frequency and 
boundary level frequency for a given motif m, is defined 
as fold enrichment value (FEV) for that particular motif 

FEV of motif ntj (FEVmi) — fi,{mj) /fg{mi) 

Boundary score. We calculated the boundary score by 
taking the overall summation of motif occurrences 
(Om,_„) and their multiplication with FEV {FEVmi_^. 

n 

Boundary score = ^ (Om; X FEVmi) 

/=i 

Implementation and availability of cdBEST. The cdBEST 
algorithm is implemented in Perl as two variants, the 
cdBEST basic and advanced. The basic version is a 
command line script requires Perl alone, can be used in 
Linux and Mac computers directly without any additional 
requirements. cdBEST_basic uses FASTA formatted 
sequence files for boundary search and produces text 
output files. The advanced version has a simple GUI 
(graphical user interface), where user can choose various 
parameters for searching boundaries. In addition to text 
files, this version produces an image output showing 
boundaries and gene annotations together with the scale 
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for easy correlation. cdBEST advanced requires Perl-Tk 
module for generating GUI and BioPerl modules (34) to 
draw image output files. In addition cdBEST search 
results can be uploaded to various genome browsers 
such as FlyBase GBrowse and UCSC genome Browser 
as custom tracks to view along with any other feature. 
The cdBEST source file for different OS platforms with 
a readme file is available for download at http://www 
.ccmb. res. in/rakeshmishra/cdBEST. html. 

Drosophila S2 cell based enhancer blocker assay 

To make NPG vector (neomycin PE enhancer GFP) con- 
struct, we used pGreen H-Pelican vector as starting 
material (35). We assembled the final NPG construct 
(Figure 4A) in three steps, (i) Vector backbone: by 
Drain digestion, we removed white gene and rehgated 
vector, (ii) Neomycin-hsp70 promoter: we assembled the 
neomycin gene (amplified from pEGFP) and hsp70 
promoter in pBluescript vector. Using Apal/Kpnl 
double digestion, this cassette was removed and cloned 
in vector backbone, (iii) PE enhancer: we amplified PE 
enhancer {twist gene's Proximal Element enhancer) from 
the Drosophila genomic DNA using specific primers 
carrying Kpnl sites, and cloned in vector backbone (36). 
In the final NPG construct the test fragments can be 
cloned in Notl and BamHl sites. We PGR amplified 
various test fragments (known boundaries and predicted 
boundary elements) and cloned in pGEM-T Easy vector. 
Using Notl sites that are present in pGEM-T Easy vector, 
we removed the inserts and cloned to NPG vector. 

We cultured Drosophila S2 cells in Schneider's Insect 
Medium (supplemented with 10% serum). We used one 
microgram of Qiagen column purified DNA to transfect 
1ml of cells (1x10^ cells/ml). Effectene transfection 
reagent kit (Qiagen) was used for transfections. After 
36 h of transfection we added G418 to cells with the 
final concentration 1 mg/ml. Once in every 4 days, we 
replaced the old media with fresh media (containing 
G418). After 6 weeks of culturing, we assayed the cells 
for enhancer-blocking activity. Flow cytometry analyses 
were done on MoFlo cell sorter using 0.5 ml of cells and 
20 000 events were scored. 

Polycomb and H3K27me3 data analysis 

We downloaded the Polycomb and H3 me3K27 
ChlP-chip data from ArrayExpress database (accession 
code E-MEXP-535) (37). Using Perl scripts and FlyBase 
coordinate converter, we converted release four coordin- 
ates to release five coordinates and extracted the data for 
regions of our interest and written the data in GFF 
format. By using BioPerl modules we generated image 
files in which gene information and predicted boundaries 
are plotted along with ChlP-chip data and manually 
counted the boundaries that lie very close to borders of 
strong PcG sites. 

Fly Atlas gene expression profile comparison 

We downloaded FlyAtlas data set containing the expres- 
sion data for 17 adult tissues, eight larval tissues and S2 
ceU line (38). This data set has the expression profiles 



for 18 880 probes which represents Drosophila 18 500 tran- 
scripts covering around 13 500 genes. As data is referenced 
based on Probe Set IDs, we added gene names, start, stop 
positions and FlyBase IDs to each probe by following an 
annotation file (Z)ra.vo/j/i;7a_2.na28.annot.csv). Using a 
Perl script and chromosome specific gene lists (prepared 
from release 5.2 GenBank files) we separated out the data 
for each chromosome and wrote it as separate files. Then 
we mixed the predicted boundaries to this chromosome 
specific data and sorted the data using start positions. 
Using another Perl script, we extracted the gene pairs 
that flank predicted boundaries and compared their 
expression profiles. Here we called a gene pair as 'differ- 
entially expressed' if their expression change direction is in 
negative correlation (i.e. the change direction is 'Up' 
verses 'Down' or 'Down' verses 'Up') for one or more 
tissues. In order to make sure that the 'Up' change direc- 
tion of a gene reflects its presence, we used Affymetrix 
caU Presence/ Absence values. We considered Affymetrix 
Presence/ Absence call value 0 or 1 as a mark for absence, 
value 3 or 4 as mark presence of transcripts. 

RESULTS 

cdBEST 

The experimentally identified boundary elements in 
Drosophila share common functional properties; however 
they do not posses any significant sequence similarity. In 
general, c«-regulatory elements are often enriched by 
small sequence motifs. Boundaries elements are no excep- 
tion to this phenomenon and they do have several such 
motifs that serve as interacting sites for boundary inter- 
acting proteins (Table 1). With the objective to identify 
new boundary elements, we analysed the distribution 
of these motifs in the euchromatic portion of the 
D. melanogaster genome. The DNA motifs we analysed 
include BEAF (19), Zw5 (20), GAF (21), Su(Hw) (23,24), 
Elba (27), CTCF (31,32,39) and Fab-7 Motif (F7M) 
(Mishra,R.K., unpubhshed data). The distribution 
pattern shows that the smaller motifs, BEAF, Zw5 and 
GAF are more randomly distributed in the genome, but 
occur as clusters in boundary regions (Table 1). The larger 
motifs hke Su(Hw), CTCF are non-randomly distributed 
in the genome and highly enriched in the boundary regions. 

Based on motif clustering and enrichment, we divided 
the boundaries into five types: Fab-7 type, Fab-8 type, 
SCS type, SCS'-BE28 type and gypsy type (Table 2). 
For each boundary type, we evaluated for criteria like, 
the total number of motifs, predominant motif(s) and 
the average gap between motifs. Taking inputs from 
these analyses we built a boundary element search tool, 
cdBEST that looks for boundary type specific motif 
clusters under a defined set of constraints (Table 2). We 
included a scoring method in this tool to eliminate false 
predictions. The boundary score for a given sequence is 
calculated by the overall summation of motif occurrences 
multiplied by their respective FEV. In general FEV of 
motifs are calculated by comparing their occurrence 
frequencies in positive verses background data sets 
(40^2). We used known boundary regions as positive 
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data set and Drosophila genome sequence as a background 
data. The calculated FEV (Table 1) are incorporated 
in the tool to derive the boundary score. We set 
boundary type specific minimum required scores in 
cdBEST (Table 2). 

To test the tool for its efficiency, we used set of known 
boundary sequences (Supplementary Table SI), cdBEST 
picked up 10 of the 11 boundary sequences as hits with 
varying scores (Supplementary Table S3). SFl was the 
only boundary that failed to a give hit because of its 
poor motif content. The gypsy boundary achieved a 
highest score of 1722.6, while the SCS boundary 
received the lowest score 43.97. cdBEST did not yield 
even a single hit when regulatory element sequences such 
as enhancers and polycomb response elements (PREs) 
were used as input (Supplementary Table S4) indicating 
the accuracy of the tool. A test run on a sequence region 
that covers Drosophila Bithorax complex (BX-C) identifies 
12 boundaries including previously known Mcp, Fab-6, 
Fab-7 and Fab-8 boundaries (Figure 1 and 
Supplementary File SI). 

Whole genome analysis for boundaries in 
D. melanogaster 

Drosophila melanogaster genome release 5.2 was used as 
input for boundary search. Each chromosome was separ- 
ately analysed using 750 bp as set window size and 10 bp 
as window slide. Under these conditions, we retrieved 
4576 boundaries in the whole genome (Table 3 and 
Supplementary File S2). cdBEST correctly identified the 
reported boundaries, even-skipped, TER94, Abd-Bm and 
myoghanin-eyeless (ME), which were not included in our 
positive data set (43-46). The average domain size 
deciphered by predicted boundaries varies from 19 to 
31 kb for various chromosome arms. Density of predicted 
boundary was greater on the X chromosome despite 
having a low gene density and moderate size. 

Boundaries with repetitive occurrence are associated 
with transposable elements 

To find multicopy or repetitive boundaries in the 
D. melanogaster genome, we carried out BLAST 
sequence ahgnnients among the predicted boundary 
sequences. We used an identity of >90% over a stretch 
of 100-bp sequence to call repetitive boundaries in the 
genome. Among the 4576 predicted boundaries, we 
retrieved 55 groups of repetitive boundaries containing 
239 individual elements. The number of boundaries 
within a group ranges from 2 to 39 (Table 4 and 
Supplementary Table S5). We also found the known 
gypsy boundary in the Hst with two copies. This led to 
the assumption that many of these multicopy boundaries 
may be associated with transposable elements. To test this, 
we compared these repetitive boundary sequences with 
known transposable element sequences from databases, 
FlyBase and Repbase (47,48). Of the 239 multicopy 
boundaries that are found in the D. tnelanogaster 
genome, 173 showed significant sequence similarity 
(>90% identity over 100-bp sequence) with transposable 
elements. Drosophila melanogaster has 96 known families 



of transposable elements that covers the ~5% of the eu- 
chromatic part of the genome (49,50). Out of 173 
boundaries of repetitive nature that are identified by 
cdBEST 110 boundaries maps to nine of these famihes 
(Table 4 and Supplementary Table S5), indicating that 
only a small subset of transposable elements have 
boundary activity. 

Application of cdBEST in other Drosophila species 

Encouraged by the performance of cdBEST in 
D. melanogaster genome, we wanted to extent cdBEST 
prediction to 11 sequenced non-melanogaster species 
(50). Considering the evolutionary closeness of these 
species we expected that the boundary elements to be 
conserved and cdBEST might be able to pickup 
boundaries across these species. First, we asked whether 
cdBEST can predict prominent boundaries, such as Fab-7 
and Fab-8 in these species. For this, we used region(s) that 
covers Bithorax complex to predict boundaries and found 
hits that are orthologous to these two boundaries in all 
Drosophila species except grimshawi, where Fab-8 alone 
was predicted (Supplementary Figure SI). As cdBEST 
correctly recognizes these two test boundaries in 10 out 
of the 11 species, we apphed cdBEST for genome-wide 
boundary prediction. We downloaded the assembled 
genome sequence of these 11 species from FlyBase 
(FB2011_05 Release) and screened for contigs that are 
>200kb (47). Each genomic chromosome/contig/scaffold 
was subjected to boundary search and the total number of 
boundaries was counted using an automated script. 
cdBEST identified 88533 boundaries for these 11 
non-melanogaster Drosophila species. The entire predic- 
tion data can be downloaded from our website (http:// 
www.ccmb.res.in/rakeshmishra/cdBEST.html). Some 
species show very high number of boundaries when 
compared to other species (Figure 2 and Supplementary 
Table S6). Drosophila mojavensis has the highest number 
of predicted boundaries (15781) among all 12 species 
in-spite of not having the largest genome size. We also 
searched for the occurrence of repetitive boundaries in 
each of these species and extended the transposable 
elements verses repetitive boundaries comparison. In the 
end, we found large number of boundaries that are 
associated with transposable elements across these 
species (Supplementary Table S7). As shown in Figure 2, 
the percentage of repetitive boundaries closely follows the 
repeat contents of these 12 Drosophila genomes (50). 
D. ananassae and D. virilis are the only two species that 
are having higher repeat boundary percentage than their 
overall repeat content. This may be because of species 
specific repeat sequences with boundary potential are 
present in these species. The highest copy-number 
boundary (2702 copies), is indeed a species specific 
repeat sequence of D. ananassae. 

Epigenomic context of boundaries identified by cdBEST 

Boundary elements that mark the borders of repressive 
domains. During the early embryonic development, the 
active and inactive chromatin regions are marked by 
specific post-translational histone modifications in a 
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Table 1. Motif frequency and fold enrichment in boundary compared to whole genome 



S. No. 


Motif 


A/Tnttf 'ipniipnpp^ 


Boundary level frequency 
Boundary Occurrence" 


frpn 1 ipiipv*"'^ 


Pnl H pn ri p1i in pn t 
in hni inH ;i ript; 

111 L'U LlllUlCll lv^k3 


1 


BEAF 


CGATA 


SCS' and BE28 


9.15 


1.292 


7.09 


2 


Zw5 


GCTGMG 


SCS 


5.03 


0.963 


5.23 


3 


GAF 


GAGAG 


Fab-7 


4.89 


1.344 


3.64 


4 


Su(Hw)-Ml 


YRYTGCATAYYY 






0.022 


156.55= 




Su(Hw)-M2 


YWGCMTACTTHY 


(2L-203)'' 


3.47 


0.022 


156.55 


5 


Elba 


MCAATAAG 


Fab-7 and Fab-8 


0.99 


0.069 


14.21 


6 


CTCF-Ml 


MHRGRKGKCGCY 


Fab-8 


2.49 


0.016 


150.91 




CTCF-M2 


YAGRKGKCGC 


Fab-8 


1.25 


0.020 


61.58 




CTCF-M3 


RRCGCCMYCYRKY 


Fab-8 


1.25 


0.008 


165.67 


7 


Fah-7motif 


CCAATTGG 


Fab-7 


1.63 


0.022 


73.02 



"Motif names are defined based on the binding protein for the purpose of computer searching. Ml, M2 and M3 are alternative or additional binding 
motifs of the protein. ''lUPAC code. 'Occurrence per kb. ''Whole genome used here includes only the Euchromatic regions (X, 2L, 2R, 3L, 3R and 4) 
of release 4.1. "Assigned based on the value obtained for Su(Hw)-M2 Motif, as both have similar genomic frequencies. ''2L-203, 3.09, 3.28, 2L-203, 
X-103 and y-45. 



Table 2. Five boundary types and prediction criteria for new boundaries 



Boundary type Motif cluster Boundary mapping criteria 







Specific feature 


Motif/gap" 


Score 


1. Fab-7 


GAF[6], Elba[l], F7M[2], Zw5[2] 


>2 kinds of motifs 


8/90 


60 


2. Fab-8 


CTCF-M1[2], M2[l], M3[l], GAF[2], Elba[l], BEAF[1] 


>1 CTCF motif(s) 


2/90 


75 


3. SCS 


Zw5[9], BEAF[3] 


Two Zw5 motifs'' 


8/90 


37 


4. SCS' 


BEAF[8], Zw5[2] 


Six BEAF motifs" 


6/90 


43 


BE28 


BEAF[7], Zw5[2] 








5. Gvpsy 


Su(Hw)-Ml[ll], M2[l] 


>2 Su(HW) motifs 


2/125 


313 


21-203 


Su(Hw)-Ml[2], M2[3], CTCF[1], Zw5[l], BEAF [1] 








X-103 


Su(Hw)-Ml[3], M2[3] 









Motifs in bold are the predominant/experimentally tested motifs in a particular boundary type, numbers in bracket indicate their occurrences. 
"Motifs/gap combination shows the number of total motifs required and with allowed average gap. ''Here two high affinity motifs (Zw5 motif flanked 
by next Zw5 motif with 13 bases as maximum allowed gap) are required. "Here two high affinity motifs (BEAF motif flanked by next BEAF motif 
with 16 bases as maximum allowed gap) are required. 



tissue or cell type manner (51). The marked histone status 
is further inherited or maintained in subsequent gener- 
ations with the help of Polycomb and trithorax group of 
proteins (51-53). The regions that are covered by 
Polycomb proteins and marked by H3K27me3 modifica- 
tions are shown to form repressive domains in the genome 
(53,54). As Polycomb proteins can spread on chromatin 
over a considerable distance and repress the gene activity, 
boundary elements are thought to be involved in limiting 
the spread of repressive chromatin and define the borders 
repressive of domains (54-57). In this background, we 
analysed an available Polycomb binding data (37) 
derived from Drosophila S2 cells and asked whether 
cdBEST can locate boundaries that can limit the 
Polycomb spreading or define the borders of repressive 
domains. We used a data set that consist 95 strong PcG 
sites as repressive domains. The ninety five repressive 
domains yielded 190 border regions for analysis. Overall, 
of the 190 border regions we analysed, 108 regions have 
well positioned boundaries (data not shown). 

Here, we show two specific regions as examples and 
discuss them in detail. The first region, NK homeodomain 
region, has a repressive domain marked by H3K27me3, 



containing CG31179 and CI 5 genes that are transcription- 
ally silent in S2 cells. However, the repressive domain is 
immediately flanked by hypomethylated regions that 
contain transcriptionally active genes CG7922 and 
CG7956. cdBEST predicts boundary 3R_591 and 
3R_592 and mark the limits of the repressive domain 
(Figure 3A). The second region is dco-SoxlOOB region 
where a presumptive PRE lies between the two divergently 
transcribed genes (Figure 3B). PRE can spread silencing 
effect to long distance and one of the possible ways by 
which nearby genes are protected from the silencing 
effects could be the intervention of a boundary between 
the active and silent regions. cdBEST finds boundary 
3R_913 near dco promoter. It is interesting to note that 
boundaries closely associated with promoters have been 
reported earlier (43). 

Boundaries that separate domains of differentially 
expressed genes 

As boundary elements can organize neighbouring genes 
into independent chromatin domains, one would expect 
that the 5'- and 3'-flanking genes of a boundary element 
may have independent expression profiles. To test this. 



4390 Nucleic Acids Research, 2012, Vol. 40, No. 10 



CTCF-N 
CTCF-C 
CP 190 
Mod(mdg4) 

GAF 
Su(Hw)-1 
Su{Hw)-2 
BEAF 



I I 
I 



I 
I 
II 



I 

I 

I i 



I; 
l! 



Known boundaries 
BX-C 

Transcripts 
cdBEST boundaries 



12480 kb 



-+- 



ll-> 

CG31498 



I 

I 

I 
I 



rTTiiTTT 



I 1 1 

I i I II I 



Ubx 
□ 



C631217 

i I 

422 423 



CG31275 



424 



Gluts 



abd-A 
HH 



I I 
I 



425 



I 

426 



I 

II 



II 
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I I I 



I I I I 

Mbp Fab-6 Fab-7 Fab-8 
-I- 
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1 Ct31034S 
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Figure 1. cdBEST analysis for the boundaries in the Drosophila Bithorax Complex. A 320 kb region of chromosome 3R, which consist of the 
BX-Complex is drawn according to scale. The upper yellow panel shows the in vivo binding profiles of various boundary proteins [plotted using a 
data from the recent study (64)]. The known boundaries were mapped and shown as red boxes. The lower panel shows annotated genes and cdBEST 
predicted boundaries with boundary numbers (corresponding to chr3R prediction). Dashed vertical lines show alignment of the cdBEST predictions 
against in vivo binding profiles of boundary interacting proteins. 



Table 3. Whole genoine analysis for boundary elements using cdBEST 



Chromosome arm 


Size (bp) 


No. of 
boundaries 


Boundary frequency 
[per lOOkb] 


No. of genes 


Gene density 
[genes/100 kb] 


Average 
domain size" 


Average genes 
(kb) per domain 


2L 


23011 544 


784 


3.41 


2766 


12.0 


29.4 


3.5 


2R 


21 146 708 


830 


3.92 


3088 


14.6 


25.5 


3.7 


3L 


24 543 557 


793 


3.23 


2848 


11.6 


31.0 


3.6 


3R 


27 905 053 


953 


3.42 


3547 


12.7 


29.3 


3.7 


4 


1 351 857 


52 


3.85 


90 


6.7 


26.0 


1.7 


X 


22422 827 


1164 


5.19 


2314 


10.3 


19.3 


2.0 


Whole genome 


120 381 546 


4576 


3.80 


14653 


12.2 


26.3 


3.2 



"Domain size was calculated by dividing the chromosome size with number of boundaries. 



Table 4. Transposon associated multicopy boundary elements in 
D. melanogsater 



S. No. 


Predicted 


Number 


Associated 


Predominant 




boundary 


of copies 


transposon 


motif(s) 


1 


X_52 


39 


Doc 


GAF 










Elba 


2 


2L 14 


23 


blood 


BEAF 


3 


2R_83 


8 


Rtla 


CTCF 










GAF 


4 


4_2/4_3 


12 


GATE 


BEAF 










CTCF 


5 


X 1143 


7 


G-element 


BEAF 


6 


2L 768 


6 


Rtlb 


CTCF 


7 


4_48 


5 


TART-A 


BEAF 










GAF 


8 


2L 86 


5 


mdg3 


BEAF 


9 


X_921 


5 


297 


GAF 










F7M 



we compared the expression profiles of 5'- and 3'-flanking 
genes of predicted boundary elements using the publicly 
available FlyAtlas data set (38). On the basis of intergenic 
and gene flank criteria, we have shorthsted 2559 
boundaries for this comparison. For each boundary we 



extracted 5'- and 3' -flanking genes using a gene table 
and compared their expression profiles using a Perl 
script. The expression profile includes the data for 17 
adult tissues, eight larval tissues and a cell Une (S2 cells). 
We termed two genes as 'differentially expressed', if their 
expression profiles are in negative correlation for one or 
more tissues (See 'Materials and Methods' section for 
details). Further 497 boundaries were removed from the 
hst as their flanking gene's expression profile is not avail- 
able in FlyAtlas data set. In the end, we have isolated 1545 
boundaries that are having differentially expressed gene 
pairs as flanking genes (Supplementary File S3). These 
1545 boundaries constitute 75% of total intergenic 
boundaries that we considered in this analysis. The 
boundaries 3R_592 and 3R_913 are also appeared in 
this analysis as they separate the differentially expressed 
genes i.e. C15-CG7956 and dco-SoxlOOB respectively. 

Enhancer-blocking activity of the boundaries identified 
by cdBEST analysis 

To assay predicted boundary elements, we designed a con- 
struct that can be used in Drosophila S2 cells. We call this 
construct as NPG (Neomycin-PE enhancer-GFP) and it 
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has two reporters, neomycin gene for selection of the 
plasmid and GFP to assay the enhancer-blocking 
activity (Figure 4A). We cloned two known boundaries 
{Fab-7 and Fab-8) and 19 predicted boundaries of 
D. melanogaster (see Supplementary Table S8 for hst of 
primes) in this construct and transfected them in S2 cells. 
We also included NG construct, a minimal version of 
NPG that lacks PE enhancer. We selected for stable inte- 
grants by growing the transfected S2 ceUs in G418 con- 
taining cuhure medium for 6 weeks and assayed for 
enhancer blocker activity using Flow cytometry analysis. 
The result show that Fab-7 and Fab-8 function as strong 
enhancer blockers in this assay, as their percentage of ceUs 
that express GFP is comparable to NG (Figure 4B). Of the 
nineteen cdBEST boundaries that were assayed fifteen 
showed enhancer-blocking activity (11 strong & 4 
moderate). In the remaining four, two elements that are 
close to Abd-B promoter had higher GFP fluorescence 
than NPG vector (Figure 4B). It is interesting to note 
that one of the tested boundary, 4_29, is repetitive in 
nature and it also showed strong enhancer-blocking 
activity. 



Figure 2. Boundaries and their repetitive nature in 12 Drosophila 
species. Four different data series, boundaries, repetitive boundaries, 
genome sizes and their repeat contents are plotted in logarithmic 
scale covering all 12 Drosophila species. Repetitive boundaries curve 
closely follows the repeat contents of the genomes indicating a strong 
positive correlation between them (i.e. genomes with higher repeat 
content are more likely to have higher number of repetitive 
boundaries). 



DISCUSSION 

Chromatin domain boundaries are the key regulatory 
elements that help in packaging the genome and regulating 
gene expression as they are known to subdivide the 
genome into independent functional domains (9,58-61). 
Mapping the boundary elements at the genome level, 
therefore, gives a global view of the structural and func- 
tional organization of the genome. Unhke coding or other 
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Figure 3. Predicted boundary elements mark the borders of Polycomb mediated repressed domains. The A and B parts are two representative 
regions of chromosome 3R of Drosophila genome. Upper panels show the predicted boundaries and annotated gene transcripts with scale. 
Lower panels show the binding profiles (ChlP/input ratio) for H3K27me3, PC, PSC and E(Z) proteins obtained from previously published 
ChlP-chip study (37). 
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Figure 4. Predicted boundary elements function as enhancer blockers in Drosophila S2 cells. (A) The enhancer-blocking assay vector, NPG, showing 
the neo resistance gene, the PE enhancer, the GFP reporter gene and the test DNA insertion site. If the test DNA blocks enhancer-promoter 
communication, the stably transfected cells would have a lesser number of GFP positive cells. (B) Flow cytometry analysis was used to determine 
number of GFP positive cells. For each test DNA, percentage of GFP positive cells was calculated and plotted relative to NPG vector transfection. 
Filled black boxes indicate strong enhancer-blocking activity and half-filled ones indicate moderate activity and empty boxes show weak or no 
blocking activity. 



regulatory elements, boundaries do not have prominent 
sequence features that are common to all boundaries. 
Although a boundary can replace another boundary in 
the endogenous locus and rescue the function, they lack 
any apparent primary sequence similarity (62). This, there- 
fore, rule out simple sequence comparison based search 
for such elements in the genome. Biochemical analysis of 
several boundaries elements in Drosophila earlier have led 
to the identification of small sequence motifs that are rec- 
ognition sites of the DN A-binding proteins involved in the 
boundary function. We noticed that all the boundaries 
contain a cluster of such motifs and several of them are 
common in subset of boundaries. Based on this motif clus- 
tering, here we describe a bioinformatics approach, 
cdBEST, to identify boundaries in D. melanogaster 
genome. We were also able to assign common sequence 
features and derive boundary type specific criteria needed 
to predict various subclasses of boundaries that exist in 
Drosophila genome (30,63). 

We used several approaches to vahdate the cdBEST 
predicted boundaries. Using an available genome-wide 
epigenetic profihng data of Polycomh group of proteins, 
we looked if a predicted boundary was seen between the 
repressive and active regions of the genome based on epi- 
genetic mark. Several predicted boundaries, indeed, were 
found in such locations supporting that boundaries 



subdivide genome into functional domains (Figure 3). 
In an independent approach, we also observed that 
around 75% of the genes that flank predicted boundary 
are differentially expressed in one or more tissues/cell 
Hnes. This supports the anticipated role of the predicted 
boundaries in their genomic context. The strongest 
support for the relevance of the cdBEST predicted 
boundary elements comes from the direct demonstration 
of their enhancer-blocking activity. We tested 19 predicted 
boundaries and found that 15 of them function as 
enhancer blockers in Drosophila S2 cells (Figure 4). 
These results allow us to conclude that the predicted 
boundaries are indeed functional elements in the 
Drosophila genome. 

Several approaches have been used recently to address 
chromatin domains and boundaries in Drosophila and 
human. One of these approaches is computational 
search of motifs across the genome that are the sites of 
interaction for individual boundary interacting protein 
(24,40). This approach leaves other factors that may 
co-occupy the boundary region while cdBEST uses 
'motif cluster' approach, where cluster of boundary 
motifs is preferred over single motif. In addition, 
cdBEST includes all the known boundary motifs and 
covers various boundary types that are present in 
Drosophila, which makes the search more comprehensive. 
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cdBEST also has the flexibiUty of changing parameters 
and constraints and can be set to individual motif 
search too. 

Another approach to identify boundaries at genome 
scale is by ChIP based in vivo occupancy of individual 
boundary proteins (64-66) which is experimental equiva- 
lent of the above discussed computation approach. 
A major difficulty in this approach is that majority of the 
boundary interacting proteins in Drosophila are also 
involved in other nuclear functions such as transcriptional 
repressor or activator and, therefore, each site detected for 
interaction may not necessarily reflect boundary (65). 
Furthermore, ChIP experiments are often performed 
using a single cell hne, or mixed tissue such as embryo 
which may not reflect the complexities involved in each 
and every tissue and ceU types (30). To investigate this, 
we compared our cdBEST boundaries with a pubhshed 
Chip data (64). As indicated in Figure 1,11 out of the 12 
predicted boundaries that are present in BX-C had clear 
overlapping ChIP signals. We find that ~55% of the 
cdBEST boundaries have an overlapping ChIP signal for 
one or more boundary proteins. While this is a reasonable 
agreement, it is possible that at least some of the remaining 
45% cdBEST predicted boundaries may be tissue specific 
and may not be bound by proteins in cells where they are 
not functional. Also, since cdBEST used sequence motifs of 
additional boundary proteins, for example, Zw5, Elba and 
F7M, and genome scale ChIP data is not available for these 
proteins, cdBEST can still predict boundaries dependent on 
above mentioned factors. We also noticed several instances 
where a site identified as binding region for a boundary 
protein in vivo (for example, CP 190) does not have the con- 
sensus DNA sequence motif on boundaries (64,67). 
Considering that boundaries can cluster together, some 
may not have direct binding sites and may be recruited 
through protein-protein interactions (67). Such boundaries 
can show up in ChIP based analyses but will be missed in 
recognition motif based predictions. Since cdBEST uses 
experimentally tested sites in the context of their 
boundary function and the scoring system has been 
optimized keeping 'true boundary motifs' context in the 
consideration, it has stringent predictive value. In 
addition, cdBEST has a clear advantage over ChIP 
approach as it can be apphed any other closely related 
genomes. 

The third genome scale boundary search approach has 
been to look at the transition regions in profiling of 
histone modifications or chromatin proteins that define 
chromatin domains (68,69). This approach is novel and 
most recent with the limitation that it is human specific 
and does not offer any tool which can be apphed to 
other genomes (69). Although we have cdBEST for 
D. melanogaster, since it uses motif based approach it 
offers a tool which can be optimized in many other 
closely related genomes as seen from our boundary search 
results in non-melanogaster drosophihds. A related study 
that used genome scale profihng of more than 50 chro- 
matin proteins shows five principal chromatin types that 
are present in Drosophila Kc-167 cells (68). In this study 
8428 chromatin domains have been identified with the 
median size of 6.5 kb. Their data provides a fair idea of 



chromatin domains that are present in Drosophila cells. 
We explored how frequently the transition regions of 
these chromatin domains coincided with the boundaries 
defined by cdBEST. Of the 4576 cdBEST boundaries, 
21% (977) overlap within 2kb sequence that was used 
as the transition region, while the rest of the boundaries 
are found be located inside these domains. We took a 
close look at the BX-Complex region that has a series 
of well identified and studied boundaries separating inde- 
pendent regulatory domains (11). While aU the BX-C 
boundaries were mapped by cdBEST and vahdated by 
enhancer-blocking assays (Figure 4B), in the 'five princi- 
pal type chromatin' study the entire BX-Complex is 
marked as a single BLUE chromatin that corresponds 
to PcG chromatin. While this is in agreement with the 
chromatin state of Kcl67 cell hne, where BX-C genes are 
repressed by PcG proteins, it does not reflect the dynamic 
and cell type specific redistribution of chromatin types. 
Since cdBEST uses the primary sequence alone as the 
input, it extracts all possible boundaries that depend on 
the motifs used even if they may not functionally exist in 
a particular ceU type or state. Such an inclusive approach 
gives the global picture the genome organization. 

Any whole genome analysis is not complete, specially, 
in higher eukaryotes, unless it takes into account the 
repetitive elements. Several hues of studies indicate role 
of repetitive DNA in boundary function (16,70-72). In 
cdBEST based analysis, we also find several boundaries 
that occur multiple times in the genome and majority of 
them turned out to be associated with transposable 
elements (Supplementary Table S5). Prior to this 
analysis, gypsy and Idefix were the only transposable 
elements in Drosophila known to have boundary 
function and further experiments may identify many 
such elements and link these transposable elements to 
regulatory function. Boundary activity associated with 
repetitive sequences is of special significance as it 
provides means to regulate number of loci with fewer 
protein factors in a coordinated manner (72). 

In conclusion, cdBEST is a rehable tool to detect 
boundaries at whole genome scale in D. melanogaster 
and many other drosophihds. With the help of cdBEST, 
we can annotate a significant portion of the genome 
('^3%o) as boundary elements. As majority of the 
boundary interacting proteins are conserved among 
insects (73,74), cdBEST can be easily adapted to other 
insect genomes to search boundary element sequences 
and annotate their genomes for boundaries. With the 
increasing number of species whose genome sequences 
are being made available, for example, i5k project of 
5000 insects and other arthropods (75), tools hke 
cdBEST win be helpful to analyse and understand 
features of genome organization and function. 
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